This report aims to analyze video game sales data from 2006 to 2010, sourced from a dataset available on the Kaggle platform. Kaggle is renowned for its rich data sources frequently utilized by data analysts and researchers for various analytical projects. The report provides comprehensive insights into video game sales trends, genre popularity, regional sales performance, and an analysis of leading platforms and publishers during the specified period.
The structure of this report includes several main sections:
The main topic of this report is the analysis of video game sales exceeding 100,000 copies from 2006 to 2010. Video game sales are a crucial indicator reflecting the popularity and success of games in the market.
The issues addressed in this report include:
This analysis is significant as it provides a deep understanding of the dynamics of the gaming industry from 2006 to 2010. The video game industry is one of the most dynamic and rapidly evolving entertainment sectors. By understanding sales trends, genre popularity, and regional performance, game developers, publishers, and marketers can make more informed and strategic decisions. This analysis is valuable not only for understanding the current market conditions but also for predicting future trends and identifying new opportunities.
This analysis is relevant to various stakeholders interested in the dynamics of the gaming industry. The findings can be utilized for:
This report is intended for:
The main objectives of this analysis are to answer several key questions:
This analysis employs Exploratory Data Analysis (EDA) methods, including:
Exploratory Data Analysis (EDA):
Is an approach to analyzing datasets with the aim of summarizing their
main characteristics using visual methods. This technique includes data
visualization using graphs such as histograms, scatter plots, and box
plots to understand data distribution, relationships between variables,
and trends over time. Statistical analysis is also applied to measure
correlations between variables, convey the significance of findings, and
test hypotheses to understand the data more deeply before it is formally
modeled [1].
The Kruskal-Wallis:
Is used as a non-parametric alternative to the ANOVA test, useful for
comparing the medians of three or more independent groups of data. In
the context of video game sales analysis, this test is used to see
whether there are significant differences in sales between regions or
certain game genres [2].
The Chi-Square test:
Is a statistical technique for testing the relationship between two
categorical variables. In video game sales analysis, this test is used
to determine whether there is a relationship between game genre
preferences and market region, or whether there is a dependency between
the popularity of a particular platform and game sales by genre
[3].
Some assumptions and limitations of this analysis include:
Data Accuracy: The data used is sourced from Kaggle for the period 2006-2010. While Kaggle is known for rich datasets, it may not encompass all recent changes in the video game industry. For example, trends from 2017-2020 may not be well-reflected due to limited data availability for those years.
Time Constraints: The analysis is constrained by the timeframe of data from 2006 to 2010. Significant changes in market preferences and technology in the gaming industry after 2010 may not be fully captured in this report.
Lack of Recent Data: Despite efforts to incorporate newer data, availability for years 2017-2020 has been limited or inadequate in the available sources. This may affect the balance of representation in our visualizations and analysis.
Industry Dynamics: The gaming industry is a highly dynamic market with rapid changes in technology, consumer preferences, and global trends. These factors can influence the analysis outcomes based on historical data used in this report.
Interpretation of Results: The results of this
analysis should be understood in the context of the timeframe and data
used. Interpretations regarding genre preferences, platforms, and
business strategies should be adjusted for contextual changes and
current market dynamics, which may differ from the period of
2006-2010.
This analysis is worth reading as it provides actionable insights into business strategies within the gaming industry. For us, this analysis is compelling as it combines our interest in data and the gaming industry, offering an opportunity to apply data analysis skills in a real-world context. Understanding trends and preferences in the gaming industry can be crucial for success across various aspects of the video game business. This analysis also provides an opportunity to delve deeper into how specific factors impact game success, which can be highly valuable to anyone involved in this dynamic industry.
The dataset contains information on video game sales exceeding 100,000 copies. It was sourced from Kaggle, which originally obtained the data from vgchartz.com. The dataset spans from 1980 to 2020, but the analysis focuses on the years 2006 to 2010.
1. Name:
2. Platform:
3. Year:
4. Genre:
5. Publisher:
6. NA_Sales, EU_Sales, JP_Sales, Other_Sales, Global_Sales:
| Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1 | Length:16598 | Length:16598 | Length:16598 | Length:16598 | Length:16598 | Min. : 0.0000 | Min. : 0.0000 | Min. : 0.00000 | Min. : 0.00000 | Min. : 0.0100 | |
| 1st Qu.: 4151 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 0.0000 | 1st Qu.: 0.0000 | 1st Qu.: 0.00000 | 1st Qu.: 0.00000 | 1st Qu.: 0.0600 | |
| Median : 8300 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median : 0.0800 | Median : 0.0200 | Median : 0.00000 | Median : 0.01000 | Median : 0.1700 | |
| Mean : 8301 | Mean : 0.2647 | Mean : 0.1467 | Mean : 0.07778 | Mean : 0.04806 | Mean : 0.5374 | ||||||
| 3rd Qu.:12450 | 3rd Qu.: 0.2400 | 3rd Qu.: 0.1100 | 3rd Qu.: 0.04000 | 3rd Qu.: 0.04000 | 3rd Qu.: 0.4700 | ||||||
| Max. :16600 | Max. :41.4900 | Max. :29.0200 | Max. :10.22000 | Max. :10.57000 | Max. :82.7400 |
The summary data provides a comprehensive overview of video game sales from a dataset comprising 16,598 entries. Key variables include rank, game name, platform, release year, genre, and publisher, all categorized as character data types. Sales figures across four major regions (North America, Europe, Japan, and other regions) are reported in millions of copies, with global sales also in million units. The rank ranges from 1 to 16,600, with median and mean values around 8,300, indicating a relatively even distribution. The significant difference between median regional and global sales, notably a global median of 0.17 million copies, highlights varied market preferences.
Here are the steps of preprocessing applied to the dataset, along with their reasons
# Visualizations
library(ggplot2)
library(corrplot)
library(reshape2)
library(tidyr)
library(knitr)
library(plotly)
library(RColorBrewer)
# Data Manipulation
library(dplyr)
Libraries Overview:
| Library | Description |
|---|---|
| ggplot2 | A core library for creating customizable data visualizations in R. It uses a layered graphics grammar to create plots step-by-step, offering flexibility. |
| corrplot | Used to visualize correlation matrices in R with various visualization methods. |
| reshape2 | R library for restructuring data between wide and long formats, facilitating data analysis and visualization. |
| tidyr | Enables data transformations such as reformatting, cleaning, and preparing data for analysis, adhering to tidy data principles. |
| knitr | Essential for generating dynamic reports in R, supports Markdown and HTML-based documentation. |
| plotly | Enables creation of interactive web-based visualizations directly from R, allowing dynamic and shareable charts. |
| RColorBrewer | A package that provides a collection of well-designed color schemes for data visualization in R. |
| dplyr | Provides tools for efficient dataset manipulation in R, including filtering, summarizing, and transforming data, crucial for data preprocessing. |
data <- dataVG[!is.na(dataVG$Year) & dataVG$Year != "N/A", ]
Reason: Removing rows where the Year column is NaN or “N/A” ensures that only complete and valid data are used. Missing data can introduce bias and reduce the reliability of models. By removing them, we ensure that the analysis is based on accurate information.
data <- data[data$Year %in% c("2008", "2009", "2007", "2010", "2006"), ]
Reason: Selecting only rows with specific years (2006, 2007, 2008, 2009, 2010) focuses the analysis on a relevant time period or recent data that is more important for the current analysis. This reduces noise and increases the relevance of analysis results by discarding potentially irrelevant data.
data$Year <- factor(data$Year)
Reason: Changing the Year column to a factor is
necessary because years are typically used as categorical variables in
statistical analysis and visualization. By converting it to a factor, we
can easily perform grouping operations, create contingency tables, and
make relevant plots such as histograms or bar charts to understand the
distribution of data across years.By following these preprocessing steps, the data becomes cleaner, more focused, and ready for further analysis. Good preprocessing helps minimize potential errors or biases in interpreting analysis results and enhances the ability to derive meaningful insights from the available data.
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Key Observations
Insights
Definition
The Chi-Square test is a statistical procedure for determining the
difference between observed and expected data. This test can also be
used to determine whether it correlates to the categorical variables in
our data. It helps to find out whether a difference between two
categorical variables is due to chance or a relationship between them
[4].
A chi-square test is a statistical test that is used to compare observed and expected results. The goal of this test is to identify whether a disparity between actual and predicted data is due to chance or to a link between the variables under consideration. As a result, the chi-square test is an ideal choice for aiding in our understanding and interpretation of the connection between our two categorical variables.
Why do we use Chi Square Test?
There are a few advantages in using the Chi Square Test. Advantages of
the Chi-square include its robustness with respect to distribution of
the data, its ease of computation, the detailed information that can be
derived from the test, its use in studies for which parametric
assumptions cannot be met, and its flexibility in handling data from
both two group and multiple group studies.
Chi Square Test Formula
\[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]
Where:
The degrees of freedom in a statistical calculation represent the number of variables that can vary in a calculation. The degrees of freedom can be calculated to ensure that chi-square tests are statistically valid. These tests are frequently used to compare observed data with data that would be expected to be obtained if a particular hypothesis were true [5].
| Test | ChiSquareStatistic | PValue |
|---|---|---|
| Year | 98.11726 | 0 |
| Genre | 1745.53856 | 0 |
| Platform | 5331.36030 | 0 |
| Publisher | 377.22834 | 0 |
Chi Square Result:
The chi-square test results in the image show that there is no
statistically significant correlation between the year, genre, platform,
and publisher distributions of games.
Definition
The Kruskal–Wallis test is a statistical test used to compare two or
more groups for a continuous or discrete variable. It is a
non-parametric test, meaning that it assumes no particular distribution
of your data and is analogous to the one-way analysis of variance
(ANOVA) [6].
Why do we use Kruskal-Wallis Test?
The Kruskal Wallis test and other non-parametric (or distribution-free)
tests are useful to test hypotheses when the assumption for normality of
the data does not hold. They make no assumptions about the shape of data
distributions, and this makes them particularly useful when a dataset is
small. Our dataset is suitable for this method as our gathered data is
considered relatively small.
Kruskal-Wallis Test Formula
Let’s say the null hypothesis is true and thus there is no difference
between the independent samples. Then high and low ranks are randomly
distributed across the samples and should be equally distributed across
the groups. Therefore, the probability that a rank is assigned to a
group is the same for all groups [7].
If there is no difference between the groups, the mean value of the ranks should also be the same in all groups. The expected value of the ranks for each group is then given by
\[E_R = \frac{n+1}{2}\]
Each sample has the same expected value of the ranks, which corresponds
to the expected value of the population. Furthermore, the variance of
the ranks is needed, the variance can be calculated with the following
formula:
\[\sigma^2 = \frac{n^2 - 1}{12}\]
In the Kruskal-Wallis test, the test variable H is calculated. The H value corresponds to the X2 value. The H value results from:
\[H = \frac{n - 1}{n} \sum_{i=1}^k \frac{(R_i - \bar{E_R})^2}{\sigma^2}\]
| Variable | Statistic | P_value |
|---|---|---|
| Global Sales | 9.0000 | 0.437 |
| NA Sales | 9.0000 | 0.437 |
| EU Sales | 9.0000 | 0.437 |
| JP Sales | 9.0000 | 0.437 |
| Other Sales | 9.0000 | 0.437 |
| Platform | 871.4586 | 0.000 |
Kruskal Wallis Test Result:
Since the p-value is more than 0.05, we can say the null hypothesis is accepted. This means that there is no statistically significant difference between the medians of sales across regions and platforms.
| Variable | Statistic | P_value |
|---|---|---|
| Global Sales | 9 | 0.437 |
| NA Sales | 9 | 0.437 |
| EU Sales | 9 | 0.437 |
| JP Sales | 9 | 0.437 |
| Other Sales | 9 | 0.437 |
| Variable | Statistic | P_value |
|---|---|---|
| Global Sales | 9 | 0.437 |
| NA Sales | 9 | 0.437 |
| EU Sales | 9 | 0.437 |
| JP Sales | 9 | 0.437 |
| Other Sales | 9 | 0.437 |
Since the p-value is greater than 0.05, we fail to reject the null hypothesis. The null hypothesis states that there is no statistically significant difference between the medians of sales across publishers.
Sales Patterns and Regional Preferences:
Regional Impact on Global Sales:
Annual Sales Trends:
Platform and Genre Dominance:
Player Content and Preferences:
Publisher Dominance and Global Strategy:
Implications for the Game Industry Strategy:
[1] A. Verma, “Exploratory Data Analysis and Visualization Techniques in Data Science,” Analytics Vidhya, Aug. 2021. [Online]. Available: https://www-analyticsvidhya-com.translate.goog/blog/2021/08/exploratory-data-analysis-and-visualization-techniques-in-data-science/?_x_tr_sl=en&_x_tr_tl=id&_x_tr_hl=id&_x_tr_pto=tc. [Accessed: Jun. 18, 2024].
[2] “Kruskal-Wallis H Test using SPSS Statistics,” Laerd Statistics, [Online]. Available: https://statistics-laerd-com.translate.goog/spss-tutorials/kruskal-wallis-h-test-using-spss-statistics.php?_x_tr_sl=en&_x_tr_tl=id&_x_tr_hl=id&_x_tr_pto=tc. [Accessed: Jun. 18, 2024].
[3] “Chi-Square Test,” Simplilearn, [Online]. Available: https://www-simplilearn-com.translate.goog/tutorials/statistics-tutorial/chi-square-test?_x_tr_sl=en&_x_tr_tl=id&_x_tr_hl=id&_x_tr_pto=tc. [Accessed: Jun. 18, 2024].
[4] “Chi-Square Test,” Simplilearn, [Online]. Available: https://www.simplilearn.com/tutorials/statistics-tutorial/chi-square-test. [Accessed: Jun. 18, 2024].
[5] A. M. Luna et al., “Advantages of the Chi-Square Test,” PubMed, Jul. 2013. [Online]. Available: https://pubmed.ncbi.nlm.nih.gov/23894860/#:~ =Advantages%20of%20the%20Chi%2Dsquare,both%20two%20group%20and%20multiple. [Accessed: Jun. 18, 2024].
[6] “The Kruskal-Wallis Test,” Technology Networks, [Online]. Available: https://www.technologynetworks.com/informatics/articles/the-kruskal-wallis-test-370025. [Accessed: Jun. 18, 2024].
[7] “Kruskal-Wallis Test,” DataTab, [Online]. Available: https://datatab.net/tutorial/kruskal-wallis-test. [Accessed: Jun. 18, 2024].